A unified view of TD algorithms, introducing Full-gradient TD and Equi-gradient descent TD
نویسندگان
چکیده
This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation. It provides a unified view of algorithms such as TD(λ), LSTD(λ), iLSTD, residual-gradient TD. It is asserted that they all consist in minimizing a gradient function and differ by the form of this function and their means of minimizing it. Two new schemes are introduced in that framework: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, and EGD TD, which reduces the gradient by successive equi-gradient descents. These three algorithms form a new intermediate family with the interesting property of making much better use of the samples than TD while keeping a gradient descent scheme, which is useful for complexity issues and optimistic policy iteration. 1 The policy evaluation problem A Markov Decision Process (MDP) describes a dynamical system and an agent. The system is described by its state s ∈ S. When considering discrete time, the agent can apply at each time step an action u ∈ U which drives the system to a state s = u(s) at the next time step. u is generally non-deterministic. To each transition is associated a reward r ∈ R ⊂ R. A policy π is a function that associates to any state of the system an action taken by the agent. Given a discount factor γ, the value function v of a policy π associates to any state the expected discounted sum of rewards received when applying π from that state for an infinite time:
منابع مشابه
University of Alberta Gradient Temporal - Difference Learning Algorithms
We present a new family of gradient temporal-difference (TD) learning methods with function approximation whose complexity, both in terms of memory and per-time-step computation, scales linearly with the number of learning parameters. TD methods are powerful prediction techniques, and with function approximation form a core part of modern reinforcement learning (RL). However, the most popular T...
متن کاملConvergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation
We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(λ), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as nonlinear function approximat...
متن کاملImage Restoration with Two-Dimensional Adaptive Filter Algorithms
Two-dimensional (TD) adaptive filtering is a technique that can be applied to many image, and signal processing applications. This paper extends the one-dimensional adaptive filter algorithms to TD structures and the novel TD adaptive filters are established. Based on this extension, the TD variable step-size normalized least mean squares (TD-VSS-NLMS), the TD-VSS affine projection algorithms (...
متن کاملFast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation
Sutton, Szepesvári and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their “gradient temporal difference” (GTD) algorithm converges reliably, it can be very slow compared to conventional linear...
متن کاملOn Convergence of some Gradient-based Temporal-Differences Algorithms for Off-Policy Learning
We consider off-policy temporal-difference (TD) learning methods for policy evaluation in Markov decision processes with finite spaces and discounted reward criteria, and we present a collection of convergence results for several gradient-based TD algorithms with linear function approximation. The algorithms we analyze include: (i) two basic forms of two-time-scale gradient-based TD algorithms,...
متن کامل